-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use task scheduler to process optimistic responses and execute error rollback #4641
base: main
Are you sure you want to change the base?
Use task scheduler to process optimistic responses and execute error rollback #4641
Conversation
Hi @kyle-painter! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
For further context, I've only observed this issue in a local app environment (using a task scheduler enforcing batched updates) and had very little success trying to debug what are the magic series of inputs/outputs that lead to the error. I also attempted to create a minimal reproduction from the relay-examples repo and again no luck. To briefly summarise we have a connection of rows in a table and we execute a mutation to create a new row. This row is optimistically appended to the connection using
Mutation is executed
Network error and rollback occurs. During the rollback a synchronous update is flushed causing the table to be re-rendered. Interestingly, the table connection still includes 4 edges, but while trying to call
A very obscure issue to try and debug. The rendering behaviour appeared to change based on which fields I specified within my Row fragment, i.e. with What I did narrow down however, is that rollback works as expected when a successful GraphQL payload is returned, and when removing the configured task scheduler it fails in the same fashion. This is how I landed on the above outcome. We could also consider moving the scheduler down specifically to where the rollback is executed during |
e42c442
to
56ad111
Compare
Ok after a lot of debugging I was able to narrow down a reproduction which can affect both successful and failed network responses. There are a couple of key components at play:
Configuring a batched task scheduler resolves this issue for the success path, however it still crashes during a failed network response as it doesn't leverage the configured scheduler. While these may be unorthodox conditions to meet I do believe using the task scheduler to perform these rollbacks fundamentally makes sense (as how payloads are currently processed). |
cc @captbaritone for your review/feedback 😄 |
@kyle-painter Thanks for reporting this, digging in, and working on a fix. Sorry I haven't gotten back to you sooner. I'll spend some time today reviewing and looking at the case you've collected. As I'm sure you've noticed changes like this can be very hard to reason about. One thing we might want to do is put the change behind a feature flag so that we can roll it out and quickly opt back into the old behavior if we observe any issues. If we can confirm that we don't notice any issues, we can remove the feature flag. Runtime feature flags are currently implemented as a global singleton here. You can add a new option and then conditionally use the scheduler depending upon what value is set. Thanks again for your investigation here and apologies for the lagging response. |
@captbaritone has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Importing this to play around with it a bit. Still trying to ensure I can fully wrapped my head around the implications here. |
optimisticConfig.updater, | ||
false, | ||
); | ||
this._schedule(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed we don't have tests which validate this code path (all tests still pass with this wrapper removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right looks like the feature flag is set in an earlier test run and not restored
relay/packages/relay-runtime/store/__tests__/RelayModernEnvironment-ExecuteMutation-test.js
Line 556 in 56ad111
RelayFeatureFlags.PROCESS_OPTIMISTIC_UPDATE_BEFORE_SUBSCRIPTION = true; |
I'll update the new tests to run with all relevant feature flag states.
No problem @captbaritone I'll try and get the changes behind a feature flag some time this week. |
56ad111
to
74de55f
Compare
So, the bad update happens on the rollback in response to the mutation error. During the rollback, the optimistically added list item is removed. This makes sense to me
I'm a little confused by this sentence. I would expect that if a parent (containing the connection) rerenders before the children it would observe the new state and we would get a correct reflow where it would rerender the new set of children with the removed list item not present. I was expecting the cause here to be the inverse. For example, if a child is earlier in the subscriptions array than its parent, and we notify in order and unbatched, we would rerender the component that was rendering the now deleted item (triggering the error) before the parent had a chance to rerender and cause that child to unmount. Am I reading your summary correectly? If the parent really is rerendering first but with the stale connection items, do you know what's leading the parent to get rerendered but with the stale connection data? I'm also trying to wrap my head around why React's auto-batching is not helping us here. My understanding is that promise resolve should be autobatched as of React 18. I tried upgrading your example project to React 18 and the error still seems to reproduce. I can continue to dig on both of these fronts (thanks again for your detailed report with repro!) but I figured I'd ask directly in case you already had developed an understanding of the cause here during your investigation. |
In the example app, both the parent and the child retrieve the connection from the fragment separately. The parent in this instance does have the correct connection data, however I believe when the child reads the fragment it's serving the stale, cached connection.
If the
Hmm I haven't attempted to upgrade myself, did you update the app to use |
Ok I've given it a test with |
Ah yeah, I had forgotten that the implicit batching was dependent upon using createRoot. I can confirm that once I did that I also was unable to reproduce the issue. |
Operations in Relay will be flushed synchronously by default which has been observed to cause potential race conditions in the client app. In practice this is very difficult to reproduce and is dependant on various factors that may affect the component render order of the application.
An effective method to mitigate these potential errors is by configuring a task scheduler to batch updates, however, the
OperationExecutor
doesn't use the scheduler when processing optimistic responses, or during rollback after a network error meaning these events are still susceptible to obscure race condition errors. This PR uses the configured scheduler in these scenarios to help alleviate this class of error.Example of an invalid render due to such an error:
Some additional internal feedback from @tomgasson regarding these changes:
On that note, it would be excellent to get additional feedback from the maintainers on if this unexpected behaviour has been observed elsewhere, and are there other approaches you'd suggest to avoid/solve this scenario.